The data which will be analysed contains physicochemical sensory information of white wine of Vinho Verde. Vinde Verde, meaning “green wine”, is produced in Minho area located at northwest Portugal. The word “green” indicates the wine’s young age and not its color. The most important feature of the wine is its youth and freshness. And it should be consumed soon after bottling.
Vinho Verde could be white, red or rose. And the most productive kind of Vinho Verde is white. The dataset explored here includes 11 important laboratory tested physicochemical values of the wine and one sensory test result from human expert.
Next several sections will explored some important patterns and correlations of these data items. These exploration will focused on the question: what makes Vinho Verde taste good?
Number of items of the data.
## [1] 4898
Features of the data. In all the variables, X is item number. From fixed.acidity to alcohol is the quantitative values of physicochemical tests. And the last variable, quality, is the subjective grades given by human. It takes values from 1(very bad) to 10(excellent).
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Summary of the physicochemical results in the dataset. All these variables are numerical. As showed in this chart. Values of some variables are very stable and some have diverse distribution. For example, all the density is very similar and residual sugar are range from 0.6 - 65.800. And some variables tell some facts about the wine. pH is all below 4 suggest that the wine contains acid. And the very different kind of sugar level suggest some wine taste very sweet(sweetness may be from fruit).
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
The most important variable is quality. All the other values about chemical composition are for high quality. So in this section, the quality of all the values are analyzed.
Since the quality of the wine is defined in ten grades which are from human experts. The grades of wine in this data are ordinal level factors. From the numbers of each grades and the bar plot of the grades, Although there are total 10 grades (1-10) of the quality. No wine was identify as grade 1, 2, 10. This result means that no human expert gave one of the 4898 white wine of Vinho Verde very bad grades or perfect scores. No very bad grades may be explained by the fact that all the Vindo Verde are produced in the same area and under similar condition. And lacking of excellent quality (noticed that even grade 9 is very rare in almost 5000 kinds of white wine) may be the result of the marketing of Vinho Verde. Vinho Verde is regard as inexpensive and fun. The wine is not usually treated as serious and high-end wine. I believe this may result in human experts’ unwillingness to grand very high grades to this kind of wine.
##
## 1 2 3 4 5 6 7 8 9 10
## 0 0 20 163 1457 2198 880 175 5 0
Following is the pie chart of the qualities of wines. Grade 5, 6, 7 occupy most kinds of Vinho Verde. The number of wines which is graded below 5 or above 7 is very small compare to middle grades.
Above features of the variable quality suggest that the grades of Vinho Verde are very concentrated. Based on these observations, I divided the wines into three categories by the high drops of numbers of wines from grade 5 to 4 and from 7 to 8. I name these three categories “low”(grade 3 and 4), “normal”(grade 5, 6 and 7) and “high”(grade 8 and 9). It can be seen from the table and the pie chart of the distribution of three categories. low and high almost contain the same number of wines and most wines have quality normal.
##
## low normal high
## 183 4535 180
Next is histograms of all variables of the chemical content in the wine. There are several features need to be mentioned. First one is that distributions of most values follow a bell curve with a few outlier on the right side of the curve. Second is the distributions of residual.sugar and alcohol are far from standard bell curve.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
From above univariate analysis, the following analysis will be on two levels, categories and sub-categories. Analysis of categories is aiming to distinguish low, normal and high quality of Vinho Verde wines. And analysis of sub-categories is to identify differences of grades in the same category. The main content is to find differences between grade 5, 6 and 7 in normal quality.
The paired draw of these variables shows some interesting properties besides their relations with wine quality as mentioned in last chapter. The first one is the correlation of variables. It can be seen from picture that a linear correlation exists in variable density and alcohol. Wines having high alcohol tend to have low density. Similar but not apparent relation exists in variable alcohol and chlorides. The correlations between these variables indicate that these variable may contribute to the wine quality in the same manner. Next comparison of normalized 11 physicochemical variables. Some variables have many outliers( chlorides and volatile.acidity) and some few outliers( alcohol and density). It is also interesting to see the patterns of the outliers. Some outliers are distributed in a broad range. These outliers are from variable chlorides, volatile.acidity and free.sulfur.dioxide. Some outliers are centered around a relative small range. These outliers are from variable pH, sulphates and fixed.acidity. The broad range of outliers may indicate different levels of some chemical substances which occures in many measurements of chemical experiment. The clustered outliers can give more informative story about the chemical composition in the wines. The thoery is that majority of white Vinho Verde show a normal level of certain chemical feature. And there are minority of white Vinho Verde whose level of that chemical feature behave in a different way.
Let us begin with these features whose outliers clustered together. They are pH, sulphates and fixed.acidity. From the following figures, different quality of categories may have the different pH. And the difference of the other two variables is not as obvious as that of pH. The values of pH confirms the thoery mentioned before. Some high pH value wine is clustered and most wine has lower pH value. Furthermore wine with high pH value belong to high quality category. There is another subtle observation need to be examined in a further step. For variable fixed.acidity there is a little peak in front of density line of high quality. This could mean different sub levels in high quality.
Sub-categories boxp[lot] of ph and fixed.acidity is as following. As the boxplot reveals. The grades of wines veries with pH. And the highest grade of the wine(grade 9) has more fixed acidity in it. However, the analysis of any data related to grade 9 should carefully carried out. Since there are only 5 wines are rated 9.
The remaining 8 variables are also analyzed. The density plot is as following. To discriminate low quality with other quality, the variable free.sulfur.dioxide make sense. And the differences of high quality and other quality can be seen from variable chlorides, density and alcohol.
In the bivariate analysis, several variables are identified to be useful in analysis of wine quality. These variables are pH, free.sulfur.dioxide, chlorides, density and alcohol. Following analysis will concentrate on relationship of these variables and wine quality.
The next plot shows some relations between wine quality and variables pH and alcohol. Due to large number of normal quality, the visualization is not definitely clear. Howewver, we can still show some variations of pH and alcohol according to different medians of categories of wines. First, the high quality wines tend to have high alcohol value and low quality wines low alcohol value. Second, it is very hard to tell any difference of pH of low and normal quality and the pH of high quality is slightly bigger than that of other two qualities of wine.
The following figure can help to identify the difference of low and other two quality of wine by looking at the variable free.sulfur.dioxide. It is obvious that most low quality wine have low value of free.sulfur.dioxide. And high quality wine also have slightly high value of free.sulfur.dioxide. There are another observation need to be pointed out which is that the variation of free.sulfur.dioxide is more high than that of the other two qualities of wine. The most low quality wine of have relativey low free.sulfur.dioxide. But several low quality wine have the highest free.sulfur.dioxide in the whole collection of Vinho Verde wines.
Another similar plot can be draw between wine quality and variable chlorides and density. Many high quality wine are located at left lower corner of the plot. The values of density and chlorides of high quality wine is lower than that of the other two qualities of wine. The two values of low quality wine don’t stand out from the whole collection of wine.
Further exploration of data is needed to clarify the above relations. The difficulties arise from the unbalanced number of normal quality wines and the other two categories of wine. The way I addressed this problem is by ratios. Since most of wines are in normal quality. I will compare the ratio of low or high quality wines to normal quality wines in the different intelvals of various variables.
The following plot is the comparison of the ratios of high quality and low quality in different intervals of pH values. The x axis is the intervals of the ph values. I have projected the actual intervals on integers. The values of these integers only indicate relative values of the intervals. For example, integer 2 may indicate interval (4,4.5] and integer 1 is (3.5,4]. Y axis is the ratio of high or low quality to normal quality counts in that corresponding interval. As We can see from the plot, high quality wine have more fluctuation than low quality across the whole interval of ph value. This means the the ratio of counts of low quality wines to normal quality is relatively stable when the ph value varies. And for high quality wine, the ratio is very high when the value of pH is high. There are two clear signals in this picture. The first one is the variation of the ratios. This signal can show the relation between normal quality of wine and other two wines. The second one is the contrast of two lines. This contrast can clearly show the difference of chemical levels of low and high quality wines.
The next plot is about the variable free.sulfur.dioxide. From the drawing, we can conclude that the variable is less useful to indicate high quality wine than low quality wine. And for low quality wine the value of free.sulfur.dioxide is concentrated on lower intervals.
The plot of variable alcohol is even more informative. The curves of low and high quality wines both have big variation across the whole interval of the alcohol. And for low quality wines, the ratio is high for low alcohol value. For high quality wines, the ratio is high fo high alcohol value.
The influence on quality of variable chlorides can also be shown on the following figure. Both high and low quality wines have low value of chlorides. And it is difficult using this picture to distinguish low and high quality wines.
You must enable Javascript to view this page properly.
The most important fact is conveyed by this pie plot. Almost all the white wines of Vinho Verde have similar qualities. I think this result due to the fact that wines all cames from a small area of Portugal and the quality is strictly controlled by official organization of Vinho Verde.
The following figure show the relationship between alcohol and wine quality. The difference of alcohol in different qualities of wine is obvious. The high quality wine tends to have high alcohol.
This figure is used to resolve the unbalanced counts of normal quality wines and the other two categories of wines. The ratio variation clearly shows that for each different pH intervals the ratios of counts of high quality wine to that of normal ones are quite different. And the comparison of red and blue line tells difference of high and low quality wine. In this plot, high quality wine tends to stay in the high value of pH intervals. And the variation of ratios of low quality wines is not apparent to make the pH value a good feature predicting low quality wine.
Before I do this project, I know nothing about wines. And now I learned something about white wines of Vinho Verde from the dataset.
The dataset in consist of 4898 items of wines. The information of wines contains 11 measured physicochemical values and 1 grade given by human expert. I have done some analysis to explore the relationships of physicochemical values and the grade of wines.
The first problem I encountered is how to treat the most important variable, grade of wine. Although the grade is represented by integer numbers (1-10). I think treating the grade as a numeric value could be misleading. The number itself contains no information of the grade. I treat the grades as factors. Furthermore, the grade is given by human. That means the border of adjacent grade may be very blurring. People may tell very strong evidence to distinguish grade 9 from grade 3. However, the difference of grade 5 and grade 6 may be subtle and subjective. So to clearly show the differences of various grades of wines, I decided to combine 9 grades into 3 major categories.
I think the analysis of the single variale, quality, has already bring me plenty of information. The first is most white wines of Vinho Verde are on the similar grade. And the second is that few wines have terrible or perfect quality for Vinho Verde. I believe that the reason for the consistant quality of Vinho Verde may be due to the quality control of official Vinho Verde organization. And the reason why there are no perfect wine in 4898 wines is that the Vinho Verde is always cheap and fun. The perfect wine need to be serious and expensive.
The relations of various physicochemical values and categories of wines are also examined by boxplot and density plot. 5 of 11 variables are identified as useful to distinguish between categories. It should be awared that the boxplot and density plot have not shown the number of each categories of wines. Since the accuracy and variance of the information convey by the plot highly depend on the number of data. The results from these plot should be examined in a more formal step such as hypothesis testing. For example, there are only 5 wines rated grade 9. The number is too small to give any definitive results from these points.
In the last part of analysis, I employ a method to clarify the difference of wines in three categories. The main idea is to use ratio as a measure of appearance of wines in certain category in a particular interval of a variable. Unbalanced numbers of normal quality wine and the other two categories of wines make it very hard to show the difference between them. I have tried many method to alleviate this problem. The ratio works best for presentation.